Proteins: Structure, Function, and Bioinformatics — Latest Matching Preprints

1

Testing the reliability of AI-generated protein structures

Xu, A.; Salzberg, S.

2026-06-13 bioinformatics 10.64898/2026.06.11.731682 medRxiv

Top 0.1%

18.3%

Show abstract

Although AlphaFold2 and its competitors have demonstrated remarkable abilities to predict protein structure, more work is needed to explore the limitations of these methods. Here we investigated the reliability of AlphaFold2 and ColabFold by creating a set of realistic but false protein sequences, using ColabFold to predict their structure, and then asking how often the program produces a high-scoring structure for a sequence that does not represent a protein. We determined that AlphaFold2 has a very small but non-zero false positive rate, estimated here at approximately 1 in 435 if one uses a threshold pLDDT score of 70 to define positive predictions. We also discovered, serendipitously, that some high-scoring sequences in the human genome were not false positives, but instead were previously unknown and un-annotated pseudogenes. These latter findings indicate that some well-established human annotations of protein-coding genes may have incorrectly extended the 5 untranslated regions too far. They also suggest that AlphaFold2s false positive rate is low enough that almost any high-scoring structure, even in a noncoding region, is worthy of further investigation.

2

Structure Bioinformatics of Eight Human ATP Synthase Fo Subunits and Their AlphaFold3-Predicted Water-Soluble QTY Analogs

Zhang, S.; Wang, Z.; Chen, E.

2026-06-18 bioinformatics 10.64898/2026.06.18.733091 medRxiv

Top 0.1%

18.0%

Show abstract

Human mitochondrial ATP synthase is an essential rotary motor enzyme that produces most of the cellular ATP through oxidative phosphorylation. Its membrane-embedded Fo sector contains highly hydrophobic transmembrane subunits that are challenging to study in aqueous environments without detergents. This study explores whether applying the QTY code can reduce the hydrophobicity of selected ATP synthase Fo subunits while preserving their overall molecular structures. We applied the QTY code to eight human ATP synthase Fo subunits: ATP6, ATP8, ATPK, ATP68, ATPMK, AT5G1, AT5G2, and AT5G3. Hydrophobic amino acids leucine (L), isoleucine (I), valine (V), and phenylalanine (F) in transmembrane regions were systematically replaced with hydrophilic glutamine (Q), threonine (T), and tyrosine (Y). Four native subunits with available CryoEM structures from human ATP synthase (PDB: 8H9S) were superposed with their AlphaFold3-predicted QTY analogs. The native ATP synthase Fo subunits superposed well with their respective QTY analogs. For the CryoEM-native comparisons, RMSD values ranged from 0.565[A] to 2.546[A]. For the AlphaFold3-native comparisons of subunits without CryoEM structures, RMSD values ranged from 0.204[A] to 0.297[A]. Despite substantial QTY substitutions in the transmembrane regions, ranging from 38.89% to 50.79%, the QTY analogs retained similar overall folds, molecular weights, and isoelectric points. Hydrophobic surface analysis showed that the QTY analogs had reduced hydrophobic patches compared with their native counterparts, with average hydrophobicity decreasing from 0.2959 in native proteins to -1.1023 in QTY analogs. These structural bioinformatics studies suggest that the QTY code can be applied to ATP synthase Fo subunits to generate more hydrophilic, potentially water-soluble analogs while preserving overall structural similarity. These results extend the application of the QTY code to the membrane-embedded Fo sector of ATP synthase and provide a foundation for future experimental studies testing whether these QTY analogs can be expressed, purified, and evaluated for assembly or proton-transfer-related functions.

3

Artificial intelligence aided design of peptides with custom secondary structure motifs and reduced amino acid alphabets

Brown, S. M.; Cohen, A. B.; Dean, S. N.

2026-05-01 bioinformatics 10.64898/2026.04.29.721096 medRxiv

Top 0.1%

12.2%

Show abstract

Proteins are highly diverse functional polymers where the specific sequence of amino acids, selected from a standard genetically-encoded alphabet of twenty (C20), determines the structure and ultimately the function of the resulting folded protein. This standard alphabet has been identified to be non-randomly distributed in physicochemical properties crucial to both structure-formation and function, often referred to as coverage theory. While machine learning models have drastically improved protein structure prediction, protein design has yet to have similar development. Here we therefore bridge contemporary biological theory with recent advancements in artificial intelligence (AI) to develop and evaluate a generative AI protein design model, trained on hundreds of thousands of proteins within the RSCB PDB, for custom secondary structure motifs using reduced amino acid alphabets. Results indicate an overall success in designing novel proteins with desired secondary structure motifs for a broad range of amino acid alphabets. Interestingly this tool often captures the full three-dimensional tertiary structure of a target protein despite training only on physicochemical sequence space and DSSP secondary structure. The development of this model advances research across multiple disciplines, from general scientific AI/ML architecture development to protein design for biotechnology, astrobiology, and early-Earth evolutionary biology.

4

CCK* (Convex Closure K*): A Suite of Algorithms for De Novo L- and D-peptide Design

Childs, H.; McBride, A. C.; Donald, B. R.

2026-06-01 bioinformatics 10.1101/2025.11.21.689740 medRxiv

Top 0.1%

11.7%

Show abstract

The computational design of L-peptides and their mirror-image counterparts, D-peptides, is an active area in drug design. Peptide therapeutics offer exceptional structural diversity and high binding specificity, while D-peptides additionally confer critical advantages such as proteolytic resistance. Progress in de novo D-peptide design has been hindered by the absence of evolutionary context and limited structural data, both of which underpin the deep learning methods widely used in L-peptide design. Consequently, a robust framework capable of designing both L- and D-peptides should integrate data-driven inference with first-principles, physics-based modeling. Here, we introduce a unified computational framework that supports de novo design of both L- and D-peptides, thereby expanding the accessible design space across both chiral spaces. Convex Closure K* (CCK*) is a suite of chirality-agnostic algorithms: SCOPE, MONTAGE, and ARISE. SCOPE uses geometry as a proxy for chemical energetics, computing convex hull representations of rotameric states to rapidly generate multi-sequence protein contact maps. MONTAGE employs geometric hashing in conjunction with the K* algorithm to generate and rank backbone scaffolds according to their suitability for sequence design. ARISE is a K*-based sequence design algorithm that performs iterative residue assignment in an undirected graph to design high-affinity peptide sequences. We apply the full CCK* suite to six de novo design tasks, benchmarking chirality-preserving and chirality-inverting designs in both homochiral and heterochiral complexes.

5

Structural basis of half-site reactivity in the catalytic α-subunit of Class Ib ribonucleotide reductases

Yadav, L. R.; Chauhan, S. B.; Joshi, M.; Mande, S. C.

2026-06-17 biophysics 10.64898/2025.12.21.695763 medRxiv

Top 0.1%

10.4%

Show abstract

Ribonucleotide reductases (RNRs) employ radical chemistry to generate deoxyribonucleotides required for DNA synthesis and repair. A notable feature of RNRs is half-site reactivity, where, despite the enzyme being a symmetric 2 dimer, only one active site is catalytically active at a time while the other remains in a "poised" state for substrate binding. This phenomenon is tightly linked to the asymmetric 2{beta}2 interaction required for radical transfer. Here, we determined cryo-EM structures of the -subunit in the apo and holo states, i.e., the complex bound to TTP (effector) and GDP (substrate). The structures reveal asymmetric binding of the effector TTP and the substrate GDP across the dimer, with concomitant stabilization of loops surrounding the ligand-binding site. Interestingly, this asymmetry leads to well-resolved N-terminal density for [~]150 residues in the substrate-bound subunit, but weak density for this region in the effector-bound monomer. N-terminal domains are unresolved in both monomers of the apo structure. Isothermal titration calorimetry supports asymmetric binding of pyrimidine effectors with micromolar affinities. Molecular dynamics simulations and three-dimensional variability analysis reveal synchronous motions of loop 2, which together with the N-terminal domain drive alternate opening and closing of the active sites in the two monomers. These conformational dynamics provide key insights into the mechanistic basis of half-site reactivity. Together, these findings provide new insights into the structural dynamics and thermodynamic principles governing regulation and half-site activity in Class Ib RNRs. Significance statementRibonucleotide reductases (RNRs) are essential enzymes that supply the building blocks required for DNA synthesis and repair, yet the structural basis of their half-site reactivity has remained unclear. Using cryo-electron microscopy, calorimetry, molecular dynamics simulations, and conformational variability analysis, we show that the catalytic -subunit of a Class Ib RNR exhibits asymmetric nucleotide binding and coordinated conformational dynamics between the two monomers. These motions drive alternating opening and closing of the active sites and are linked to differential stabilization of the N-terminal region. Our findings suggest that asymmetric conformational gating and N-terminal sampling regulate productive interaction with the radical-generating {beta}-subunit, providing a mechanistic framework for understanding half-site reactivity and allosteric regulation in RNRs.

6

AI-derived Protein Structures Validation: AlphaFold2 Models in the Twilight Zone

Griffin, P.; Deganutti, G.; Jadeja, K.; Idigbe, C.; Pipito', L.; Mejuto, L.; Ng, C. P.; Peck, S.; Greaves, J.; Reynolds, C. A.

2026-05-12 bioinformatics 10.64898/2026.05.12.724499 medRxiv

Top 0.1%

9.9%

Show abstract

In any field, unquestioningly accepting artificial intelligence (AI) results should be considered bad practise. Here, we devised a comparative modelling-based strategy for validating protein structures that exploits the well-known observation that protein folds are far more conserved than protein sequences. We identify proteins with a similar fold to the AlphaFold-generated query protein and determine their structural alignment to the query. The hypothesis is that if the sequence alignment coincides with the structural alignment, then the structure is validated. The strategy is implemented on a helix-by-helix and strand-by-strand basis using a multi-template pairwise local profile alignment method that works well into the twilight zone. The method is illustrated by application to the transmembrane transporter PEPT1, for which the structure is known, and the S-deacylases ABHD13 and ABHD16A, for which only AI-generated models exist. ABHD16A is particularly challenging because a sequence alignment search with BLASTp does not reveal any structural homologues and therefore requires work with extremely remote homologues; however, both models are validated through this strategy and are stable during classical molecular dynamics simulations. The ability of the strategy to identify errors is assessed with reference to misaligned ABHD13 models and misfolded decoy proteins.

7

A Comprehensive Evaluation of Protein Structure Prediction Models for Short Peptides

Ghosh, B.; MUKHERJEE, A.

2026-07-03 biophysics 10.64898/2026.07.02.736085 medRxiv

Top 0.1%

9.8%

Show abstract

Short peptides pose distinct challenges for computational structural biology due to their lack of stable tertiary structures, high conformational flexibility, and limited evolutionary signals. To address how modern deep-learning architectures navigate these challenges, we conducted a comprehensive benchmarking of five state-of-the-art protein structure prediction models: AlphaFold2, RoseTTAFold2, ESMFold, OmegaFold, and DMPfold2. Using a curated dataset of experimentally determined short peptide structures (10-49 amino acids) from the Protein Data Bank, we systematically evaluated predictive performance across varying sequence lengths and secondary structure classes. Our results demonstrate that prediction accuracy systematically improves with peptide length. Furthermore, all models perform significantly better on -helical and mixed-structure peptides compared to {beta}-sheet-rich and intrinsically disordered sequences. Among the evaluated methods, AlphaFold2 and the single-sequence language models, ESMFold and Omegafold proved to be the most consistent and accurate overall. We also observed that internal model confidence scores are imperfectly calibrated for short peptides, necessitating cautious interpretation. Finally, by extending our analysis to the dbAMP3 dataset of uncharacterized antimicrobial peptides, we demonstrate that a multi-model consensus approach provides a rational framework for identifying robust structural hypotheses in the absence of experimental reference structures.

8

Comparison of AI protein structure ensemble prediction tools

Otten, L.; Leung, J. M. G.; Chong, L. T.; Zuckerman, D. M.

2026-05-30 biophysics 10.64898/2026.05.29.728804 medRxiv

Top 0.1%

9.4%

Show abstract

Multiple AI prediction tools for protein structural ensembles have recently been released, building on the much heralded advances from AlphaFold, large language models, and other machine-learning approaches. Here we report on a comparison of several tools (BioEmu, AFSample2, ESMFlow) using a small test set of proteins, including three which exhibit well-studied structural transitions. Overall, while the AI platforms generate structurally diverse ensembles with overlapping regions, each tool produces clearly distinct conformational distributions. Thus, it is impossible that all the tools generate ensembles of high biophysical quality, analogous to a Boltzmann distribution. Experimental structures are often, but not always, covered by the ensembles in dimensionally reduced spaces. In cases where point mutations are known experimentally to cause large structural shifts, the AI tools exhibit either small or negligible shifts. Although our current analysis cannot evaluate the absolute quality of an ensemble, and hence cannot identify a best-performing AI tool, the results suggest users pursuing downstream applications such as protein engineering or drug design should interpret these ensembles with caution.

9

The Rossmann2x2 Fold Attains its Native Structure Via a Defined Pathway of Sequential and Cooperative Folding Units

Bustamante, C. J.

2026-05-22 biophysics 10.64898/2026.05.21.726993 medRxiv

Top 0.1%

7.8%

Show abstract

Despite progress in predicting protein structures, how proteins arrive at their native state remains a subject of continuous debate. We present a single molecule force spectroscopy study of the unfolding and refolding intermediates of the conserved, diverse, and ancient Rossmann2x2 fold ({beta}12{beta}34{beta}56{beta}78). By inserting glycines at different locations in the protein, we can follow in real time and annotate its unfolding and refolding intermediates. This protein folds along a single reversible pathway involving the ordered and sequential organization of discrete and cooperative folding units or foldons: unfolded {rightleftarrows} {beta}12{beta}3 {rightleftarrows} {beta}12{beta}34{beta}5 {rightleftarrows} {beta}12{beta}34{beta}56{beta}7 {rightleftarrows} {beta}12{beta}34{beta}56{beta}78. This strict order results from the formation of an autonomously folding unit (primary foldon) and the subsequent organization of elements (secondary foldons) whose stability depends on their interactions with previously organized ones.

10

BioMetAll v2.0: Introducing Scores, Metal Discrimination, and Side-Chain Descriptors for Predicting Metal-Binding Sites in Proteins.

Marechal, J. D.; Fernandez Diaz, R.; Pena Losada, R.; Sanchez Aparicio, J. E.; Gao, W.; Alemany, M.

2026-07-12 bioinformatics 10.64898/2026.07.09.737562 medRxiv

Top 0.1%

7.7%

Show abstract

Predicting the location of metal-binding sites in proteins is crucial for fundamental biological questions and biotechnological applications. Over the past decade, the rise in metal-bound protein structures in the Protein Data Bank, combined with advanced statistical models such as deep learning, has accelerated the development of metal-binding site prediction tools. Several approaches are now available, offering high-quality benchmarks and predictive performance. Our initial development in this area is BioMetAll, whose first version was based on backbone pre-organization. Here, we introduce its second version, featuring two major updates: 1) metal-specific scoring functions and 2) prediction using backbone geometry alone or in combination with first coordination sphere descriptors. Apart from demonstrating metal sensitivity and yielding better benchmarking results, this new version allows the assessment of the influence of considering the metals first coordination sphere versus backbone pre-organization on how metallic species bind to proteins.

11

Location dependence of protein intrinsic disorder in Drosophila melanogaster

Abdulla Daanaa, H. S.; Kuraku, S.; Akashi, H.; Saito, K.

2026-07-03 bioinformatics 10.64898/2026.07.02.732782 medRxiv

Top 0.1%

7.2%

Show abstract

The relevance of protein structural flexibility in function remains contested, but experimental and computational evidence continues to accumulate. Many efforts to address this investigate intrinsic disorder, which commonly refers to peptide segments or entire protein sequences that presumably lack structure and exhibit high flexibility/conformational heterogeneity under physiological conditions. These efforts face challenges such as conflicting computational predictions and ambiguous relationships among intrinsic disorder locations and other protein properties. We address these challenges at a genome-wide scale in Drosophila melanogaster using residue-level predictions for various protein properties. We employ single and consensus approaches to quantify the prevalence of intrinsic disorder and attempt to infer function by testing for differences along protein sequences. Intrinsic disorder is likely more common at terminals than internal regions, and amino acid frequencies can vary substantially between regions in a manner that plausibly reflects functions of intrinsic disorder, rather than only proteome-wide effects. Tertiary structure potentially underlies the prevalence of intrinsic disorder along protein sequences; this prevalence varies more in a putatively solvent-exposed context than a solvent-buried one. Protein-binding appears to be a main function of intrinsic disorder, and we find support consistent with the notion that structural flexibility fosters binding plasticity, and show that location and protein length are factors in this relationship. Nucleic acid-binding and linker are ostensibly less common disorder functions than protein-binding, but nucleic acid-binding seems more localized at terminals. Residue-level estimates of selection pressure indicate that disordered regions generally evolve under weaker sequence constraints than structured regions, except at the N-terminal region. Biases in disorder prediction are a considerable factor in many of the observations, but unlikely a full explanation. The findings strengthen support for functional relevance of flexibility, offer insight into protein architecture and function, and lend impetus for experimental inquiry.

12

In Silico Structure-Based Interactomic Analysis of the Scaffolding Protein DCAF7

mezghrani, a.; Reys, V.; Labesse, G.

2026-05-15 bioinformatics 10.64898/2026.05.13.724911 medRxiv

Top 0.1%

6.9%

Show abstract

WD40 domains share a widespread {beta}-propeller fold, and often act as versatile scaffold proteins. Despite their central role in organizing dynamic cellular complexes, the molecular and structural mechanisms of many WD40 proteins remain poorly understood. Among them, DCAF7, an ubiquitously expressed and essential gene in human, also encodes a highly conserved WD40 protein in eukaryotic organisms. It is known to interact with multiple and functionnally diverse partners to coordinates cellular activity of several protein kinases as well as transcriptional regulators, thereby modulating key cellular processes such as cell growth, differentiation, and transcriptional regulation. However, the precise mode of action of DCAF7 is unknown and its important divergence in sequence from better characterize WD40 prevent information transfer by similarity. Structural interactomic can reveal how protein-protein interactions (PPIs) occur within an organism and are essential for understanding biological functions and developing new therapeutic strategies. Using SLiMAn2, AlphaFold2/3 and PSSMsearch, we identified a conserved -helical short linear motif (SLiM) in several well known DCAF7 partners that binds to the top surface of its {beta}-propeller. This motif was subsequently used to generate a regular expression, to identify potential new direct binders across the DCAF7 meta-interactome and the human proteome. Domain-domain interactions were also predicted for some other partners. Finally, modeling of oligomeric complexes with such new hits reveals the structural basis of DCAF7 scaffolding, with links to neurodevelopmental disorders such as autism.

13

The Gompertz curve for estimating growth rates of Protein Data Bank and protein folds

Sato, K.; TOMII, K.

2026-06-26 bioinformatics 10.64898/2026.06.24.732253 medRxiv

Top 0.1%

6.8%

Show abstract

The Protein Data Bank (PDB) is an ever-growing, open-access repository of structural data of biological molecules. This international database has been instrumental in the development of artificial intelligence and deep learning models for protein structure prediction and design. The PDB growth is a crucially important factor influencing further development of these models. Therefore, after analyzing the growth trend in PDB depositions since the archive's launch, we found that it is well fitted by the Gompertz function, a growth curve used across various disciplines. Furthermore, we observed that the function captures the "discovery of novel folds", i.e., the cumulative number of distinct folds among protein domains that constitute most of the PDB. Consequently, based on the fitting results, we estimated the likely numbers of PDB entries and protein folds. These findings provide insights into deceleration of growth in recent years and enable us to assess anticipated trends.

14

Deep learning based design of buried hydrogen bond networks with HBDesigner

Dieckhaus, H.; Harvey, B. T.; Mulikova, T.; Horenstein, J. T.; Nicely, N. I.; Randolph, N. Z.; Kuhlman, B.

2026-06-11 bioengineering 10.64898/2026.06.08.730848 medRxiv

Top 0.1%

6.6%

Show abstract

Accurate design of hydrogen-bonding (H-bonding) interactions is a longstanding goal in protein design, as they can facilitate specific protein-protein interactions while improving the solubility of the proteins in the unbound state. Despite this, computational design of H-bond networks remains underexplored in the deep learning era. Here, we present HBDesigner, a novel algorithm for H-bond network design. Through a combination of deep learning-based sampling and atomistic energy scoring, HBDesigner outperforms existing tools in designing connected H-bond networks onto protein scaffolds. We demonstrate the usefulness of HBDesigner by creating monomeric proteins with buried polar interactions and homodimers with extended interface H-bond networks, and by installing specificity into a family of homologous heterodimers where prior design tools fail to do so. The ability to design H-bond networks into arbitrary protein scaffolds should be broadly useful for a wide range of design applications.

15

Deciphering conformational preferences of RNA in protein-RNA recognition

Kant, S.; Masipeddi, S.; Bahadur, R. P.

2026-05-15 biophysics 10.64898/2026.05.14.725147 medRxiv

Top 0.1%

6.4%

Show abstract

Conformational plasticity of RNAs plays important roles in recognizing RNA-binding proteins, and is often modulated by their binding partners. Here, we investigate RNA conformational preferences in a non-redundant dataset of 263 protein-RNA complexes to characterize the structural landscape associated with protein recognition. RNA dinucleotide segments are analyzed using seven backbone torsion angles ({delta}1, {varepsilon}1, {zeta}1, 2, {beta}2, {gamma}2, and {delta}2), two glycosidic torsion angles ({chi}1 and {chi}2) and the pseudo-torsion angle . Focusing on dinucleotide steps present in both interface and non-interface regions, we performed density-based clustering using selected backbone torsion angles to identify recurrent conformational states. We identify 28 distinct RNA dinucleotide conformers containing at least ten members each. Among these, eight conformers represent previously unreported nucleotide conformers (NtCs), including the transitional and the non-canonical states AB06, AB07, BB21, BB22, OP32, OP33, IC08 and IC09. Several of these conformers are preferentially enriched at protein-binding interfaces, suggesting their involvement in local conformational adaptation during protein-RNA recognition. The newly identified conformers span transitional A-B geometries, distorted B-like states, open conformations and compact intercalated structures, highlighting the remarkable structural plasticity of RNA in ribonucleoprotein complexes. Overall, this study expands the current understanding of RNA conformational space and provides a refined RNA dinucleotide conformer library for protein-RNA complexes. These findings will facilitate the identification of novel RNA structural motifs and improved RNA structural modeling, docking protein-RNA complexes and deep learning-based prediction frameworks for describing RNA tertiary structures.

16

Mechanistic Interpretability for Protein Language Models: A Validation Framework

Chon, P.; ANDREOPOULOS, W. B.

2026-06-02 bioinformatics 10.64898/2026.05.29.727021 medRxiv

Top 0.2%

6.2%

Show abstract

Protein language models (PLMs) are shown to be powerful predictors of protein structure and function but their internal mechanisms remain poorly understood. Recent mechanistic interpretability methods have decomposed PLM representations into interpretable features, but they have not combined methods on a single biologically meaningful task. This paper tests whether an InterPLM sparse autoencoder and ProtoMech cross-layer transcoder can discover features in ESM-2 (6 layers, 8M) that can mainly discriminate between Class A {beta}-lactamase and Class B {beta}-lactamase with class C and D used as more challenging comparisons. The main goal is to find distinct features for Class A {beta}-lactamase that are not shared by other classes. We find that both methods find distinct features for Class A {beta}-lactamase, but the cross-layer transcoders show that the concepts for Class A {beta}-lactamase seems to be distributed among nodes such as in layer 4 and 6 rather than one node. We also showcase a validation framework to prevent overclaiming the role of a node, and we use it to show that several strong nodes fail in some stages of the framework meaning that they cannot be the sole node that defines Class A {beta}-lactamase.

17

The Hidden Disorder Divide: Reconciling Benchmark Inconsistencies in Intrinsically Disordered Protein Binding Site Prediction

Malhis, N.; Mehdiabadi, M.; Erdos, G.; Gsponer, J.; Kurgan, L.; Tosatto, S. C. E.; Dosztanyi, Z.; Piovesan, D.

2026-06-27 bioinformatics 10.64898/2026.06.24.733783 medRxiv

Top 0.2%

5.6%

Show abstract

Computational predictors of protein-binding sites within intrinsically disordered regions (IDRs) show highly inconsistent performance across high-quality benchmark datasets. To understand the origins of these discrepancies, we systematically compared predictors across three independent test sets: two CAID datasets updated with the latest DisProt annotations and a composite dataset (DBs) assembled from DIBS, FuzDB, IDEAL, and MFIB. Predictors trained predominantly on DisProt data achieved substantially higher AUCs on the CAID sets but performed poorly on the DBs. In contrast, predictors trained on older, low-quality PDB-based datasets showed balanced performance across all sets, with a slight preference for DBs. Predictors with mixed training exposure displayed intermediate behavior. Through controlled experiments using identical CNN architectures and feature analysis, we demonstrate that the dominant factor driving these performance differences is the intrinsic disorder propensity of the binding sites themselves. Binding residues in DisProt-based datasets exhibit markedly higher average disorder propensity scores than those in PDB-derived datasets. This previously unrecognized selection bias -- literature studies preferentially characterizing more disordered binding sites, while PDB-derived annotations capture less disordered ones -- effectively splits IDR-protein binding sites into two distinct categories. Predictors optimized on one category therefore generalize poorly to the other. Binding-site length and sequence conservation play only minor or negligible roles in explaining the observed inconsistencies. These findings highlight a critical limitation in current benchmarking practices and training strategies for IDR-binding site prediction, underscoring the need for more balanced and disorder-aware reference datasets. Finally, the diagnostic techniques introduced here could prove valuable beyond the specific application examined in this study.

18

AlphaFold3 predicted LWO G-protein complex from European robin features active-state biased Gα

Hungerland, J.; Kostritski, A.; Koch, K.-W.; Solov'yov, I.

2026-05-20 biophysics 10.64898/2026.05.19.726335 medRxiv

Top 0.2%

5.5%

Show abstract

Avian phototransduction and magnetoreception have been proposed to involve shared retinal proteins, including interactions between long-wavelength opsin (LWO), the cone-specific heterotrimeric G protein (Gt), and cryptochrome 4a (Cry4a), yet structural information on avian phototransduction complexes is lacking. Here we present and critically assess two atomistic models of the European robin LWO-Gt complex generated by distinct modelling strategies. A full-complex prediction using AlphaFold3 yields a tightly packed, structurally stable interface but exhibits pronounced activation-like conformational features of the Gt-subunit that persist in simulations of the isolated protein, revealing a strong bias toward the active state. In contrast, a template-guided assembly based on single-chain predictions and an experimental rhodopsin-Gt reference structure forms a weaker interface and shows no intrinsic activation bias, while still displaying subtle activation-related dynamics. These results demonstrate that machine-learned complex prediction can encode functional states independently of the local interaction environment, thereby limiting its interpretability for signalling mechanisms that hinge on activation equilibria. Our findings highlight the need for explicit assessment of conformational-state bias when modelling regulatory protein assemblies and provide a structural framework for future studies of Cry4a-dependent modulation of retinal G-protein signalling in avian magnetoreception.

19

Direct Binding of Cysteine-367 Thiolate to the Active Site of the -Hydrogenase from Clostridium beijerinckii in the O2-stable State

Duan, J.; Arrigoni, F.; Rutz, A.; Hofmann, E.; Greco, C.; Happe, T.

2026-07-13 biochemistry 10.64898/2026.07.11.737921 medRxiv

Top 0.2%

5.4%

Show abstract

[FeFe]-hydrogenases are very active biocatalysts for H2 conversion. However, their active site is vulnerable to irreversible degradation initiated by O2 binding at the catalytic iron ion (Fed) of the active center. CbA5H, the [FeFe]-hydrogenases from Clostridium beijerinckii exhibits stability towards oxygen (O2) due to its ability to reversibly enter an inactive state termed Hinact upon contact with O2. We previously proposed that the close distance of approximately 3.1 [A] between the thiol of a nearby cysteine (C367) and the Fed, based on a 2.9 [A] crystal structure of CbA5H in the Hinact state, enables their binding to each other. This binding therefore was suggested to shield the Fed from O2 damage. However, there is currently a lack of evidence to support this hypothesis. Furthermore, density functional theory (DFT) calculations based on a homologous model favored hydroxide as the binding ligand of the Fed over the thiol of C367. In this study, we present the crystal structure of CbA5H in the Hinact state at an improved resolution of 2.15 [A]. The structure reveals a direct binding between the thiol of C367 and the Fed with a distance of approximated 2.77 [A] which is well supported by our DFT calculations based on the new crystallographic data. It is noteworthy that the 2.77 [A] bond distance is strikingly long when compared with other iron-sulfur bonds. This finding may provide a crucial foundation for understanding the rapid reversibility of the Hinact state.

20

Combining amino acid frequency and 1D convolutional neural network embeddings for the identification of protein-protein interactions using a random forest classifier

Sindhi, N. A.; Pawar, N.; Dixson, J.; Garcia, D.

2026-05-18 bioinformatics 10.64898/2026.05.15.725340 medRxiv

Top 0.2%

5.4%

Show abstract

Predicting protein-protein interactions is a fundamental problem in molecular biology. Experimental approaches for identifying protein-protein interactions are time-consuming and labor-intensive, motivating the development of efficient computational alternatives, including machine learning-based methods. However, conventional machine learning methods often rely on manually engineered features that require substantial domain expertise. In this study, we propose a two-stage framework to address these limitations. In the first stage, a one-dimensional convolutional neural network autoencoder is used to automatically learn latent representations from protein sequences. The quality of these features is evaluated through reconstruction error, reflecting how accurately the model reconstructs the original sequence. In the second stage, these learned features are combined with amino acid frequency-based features to form a hybrid feature set for predicting protein-protein interactions. A systematic comparison is performed between models trained on frequency features alone and those using a hybrid representation. The comparison showed that incorporating one-dimensional convolutional neural network-derived latent features improved the models performance of predicting protein-protein interactions. The dataset was split into training, validation, and test sets. Nested cross-validation was employed, with inner loops for hyperparameter tuning and outer loops for model selection. The random forest classifier achieved the best performance, with a mean receiver operating characteristic-area under curve of 0.91 and a test F1-score of 0.87. These results highlight the effectiveness of integrating deep feature learning with ensemble methods for predicting protein-protein interactions and build upon previous work focused on this fundamental problem. Author SummaryProtein-protein interactions are fundamental in all biological processes. However, predicting these interactions is a key problem in molecular biology. Computational approaches have been tested to address this problem. We applied a mix of machine learning and deep learning to gain insight into the qualities of proteins that engage in interaction. First, we trained a deep learning model, which automatically learned the primary sequence and characters related thereto, reducing bias in the actual prediction process. We combined these features, or latent representations, with amino acid frequency features of protein sequences, and called the two together "hybrid features." Then we performed a systematic comparison of amino acid frequency features-only with hybrid features, among four different machine learning classifiers. Our results suggest that the random forest classifier performed best among all four classifiers at predicting interactions between proteins. We propose that this approach could be used to improve efficiency in testing protein-protein interactions at the bench and may have applications to other biologically relevant molecular interactions.